The purpose of this exercise is to perform exploratory data analysis (EDA) on Tweets related to the Shellshock.
The data has been downloaded using Tweeter's REST API and Tweepy python module and placed in a simple text file.
Each line in the input file is a JSON encoded Tweet related to the Shellshock vulnerability.
See Download Tweets for Shellshock.ipynb for the download part.
Most of the data munging is done in Python, but plotting is done using RMagic and ggplot2.
Why ? My R/dplyr skills are not as good as my python/pandas skills, and when it comes to charting nothing beats ggplot2.
In [ ]:
%matplotlib inline
%load_ext rpy2.ipython
In [ ]:
%%R
library(ggplot2)
library(gridExtra)
library(scales)
library(base)
print('Loaded R libraries')
In [3]:
import pandas as pd
import sys
import json
from datetime import datetime
import gzip
import time
from ggplot import *
import requests as r
from httpcache import CachingHTTPAdapter
from IPython.display import HTML, display_json, \
display_html, display_javascript, JSON, Javascript
import re
import numpy as np
from urlparse import urlparse
# Some Pandas options TODO revisit this later
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',2000)
# Cache http requests TODO confirm that this works
s = r.Session()
s.mount('http://', CachingHTTPAdapter())
s.mount('https://', CachingHTTPAdapter())
In [4]:
fName = "data/shellshockTweets.txt.gz"
tweets = []
st = time.time()
with gzip.GzipFile(fName, 'r') as fp:
for tweet in fp:
t = json.loads(tweet)
# Parse 'created_at' into a proper datetime.
t['created_at'] = \
datetime.strptime(t['created_at'],
'%a %b %d %H:%M:%S +0000 %Y')
tweets.append(t)
fp.close()
et = time.time()
print ('Loaded {0} tweets in {1:.2f} secs'.format(len(tweets),(et-st)))
In [5]:
# TODO pep8ify this code block
st = time.time()
tweetsDF = pd.DataFrame(tweets)
tweetsDF['user_loc'] = tweetsDF['user'].map(lambda u: u['location'])
tweetsDF['tweeted_to'] = tweetsDF['user'].map(lambda u: u['followers_count'])
tweetsDF['screen_name'] = tweetsDF['user'].map(lambda u: u['screen_name'])
tweetsDF['status'] = pd.isnull(tweetsDF['retweeted_status']).map(lambda s: 'original' if s else 'retweet') #Is it a retweet ?
# Take out the Link target from source.
regexp_link = r'''<a .+>(.+)</a>'''
pattern = re.compile(regexp_link)
tweetsDF['source'] = tweetsDF['source'].map(lambda x: pattern.findall(x)[0])
# Hashtags and URLs are burried deep in JSON,
# extract them and convernt them to a comma separated string.
tweetsDF['hashtags'] = tweetsDF['entities'].map(
lambda x: ' '.join(([ht['text'].lower().strip()
for ht in x['hashtags']])))
tweetsDF['urls'] = tweetsDF['entities'].map(
lambda x: ' '.join(([ht['expanded_url'].strip()
for ht in x['urls']])))
tweetsDF['urlDomains'] = tweetsDF['entities'].map(
lambda x: ' '.join((['{uri.scheme}://{uri.netloc}/'.format(uri=urlparse(
ht['expanded_url'].strip()))
for ht in x['urls']])))
et = time.time()
# A dedicated DF for unique Tweets.
uniqtweetsDF = tweetsDF[tweetsDF['status'] == 'original']
uniqtweetsDF.reset_index(inplace=True)
# for converting to R DataFrame
cols=['created_at','lang','user_loc','screen_name','retweet_count',
'tweeted_to','status','source','hashtags','urls','urlDomains']
df = tweetsDF[cols]
udf = uniqtweetsDF[cols]
print ('Parsed tweets in {0:.2f} secs and found {1} unique tweets'.format((et-st),len(uniqtweetsDF)))
In [6]:
%%R -i df -i udf
df$urls <- as.character(df$urls)
df$urlDomains <- as.character(df$urlDomains)
df$hashtags <- as.character(df$hashtags)
udf$urls <- as.character(udf$urls)
udf$urlDomains <- as.character(udf$urlDomains)
udf$hashtags <- as.character(udf$hashtags)
print('Loaded DataFrames in R')
In [ ]:
# Sample a few Tweets
from IPython.display import display
sample=tweetsDF.loc[[27,446,9898]]
HTML(sample.T.to_html())
In [27]:
%%R -w 800 -h 800 -u px
plt <- ggplot(df, aes(x=created_at, fill=status, color=status)) +
scale_x_datetime(breaks = date_breaks("1 day"),labels=date_format('%m/%d')) +
scale_y_continuous(labels=comma) +
geom_histogram(alpha=0.5,binwidth=24*60*60,drop=TRUE) +
theme(axis.text.x=element_text(angle=90)) +
xlab('Date') + ylab('Tweets') +
ggtitle('All Tweets related to Shellshock per day\nStacked by Status\n')
plt2 <- plt +
ggtitle('All Tweets related to Shellshock per day\nUnstacked\n') +
facet_grid(. ~ status)
suppressMessages(g <- grid.arrange(plt,plt2,nrow=2))
Some Notes & Observations
Notes
Observations
In [28]:
%%R -w 800 -h 800 -u px
plt <- ggplot(df, aes(x=created_at,color=status)) +
scale_x_datetime(breaks = date_breaks("1 day"),labels=date_format('%m/%d')) +
theme(axis.text.x=element_text(angle=90)) +
xlab('Time') + ylab('Density') +
ggtitle('All Tweets related to Shellshock per day\n')
plt1 <- plt +
geom_density(alpha=0.2,aes(fill=status),drop=TRUE)
plt2 <- plt +
geom_histogram(alpha=0.2,aes(y = ..density..,fill=status),drop=TRUE) +
geom_density(drop=TRUE) +
facet_grid(. ~ status)
suppressMessages(grid.arrange(plt1,plt2,nrow=2))
In [15]:
top_N_lang=df[df['lang'].isin(
pd.value_counts(df['lang'])[:6].index)]
In [16]:
%%R -i top_N_lang -w 1024 -h 480 -u px
plt <- ggplot(top_N_lang, aes(x=created_at, fill=lang, color=lang)) +
scale_x_datetime(breaks = date_breaks("1 day"),labels=date_format('%m/%d')) +
scale_y_continuous(labels=comma) +
geom_histogram(alpha=0.5,binwidth=24*60*60,drop=TRUE) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
xlab('Date') + ylab('Tweets') +
ggtitle('All Tweets related to Shellshock per day\n') +
facet_wrap(~ lang,scales='free_y')
plt
Before diving in to charts lets look at what the top 20 retweeted tweets were.
Looking at the data below, none of the tweet (bar the first one) was retweeted more than a few hundred times at max.
Infosec tweets don't attract the same kind of attention as what the flavor of the month celebrity had for breakfast.
In [17]:
# Top 20 retweeted tweets
top20Retweeted = uniqtweetsDF.sort(columns='retweet_count',ascending=False)[:20].reset_index()
# Display Top 20 Retweeted tweets using Tweeter's oembed API.
HTML('<div><h3>Top 20 Retweeted Tweets</h3><br/>' + \
"".join(['<div> Retweeted '+"{:,}".format(row['retweet_count'])+' times:' + \
r.get('https://api.twitter.com/1/statuses/oembed.json?id='+ \
row['id_str']+'&align=center&omit_script=true').json()['html']+ \
'</div>'
for row in top20Retweeted[['id_str','retweet_count']].T.to_dict().values()]) + \
'</div>')
Out[17]:
In [18]:
%%R -i top20Retweeted -w 800 -h 600 -u px
labs <- paste(rev(reorder(top20Retweeted$screen_name,top20Retweeted$retweet_count)) ,
rev(reorder(top20Retweeted$tweeted_to,top20Retweeted$retweet_count)), sep=':')
plt <- ggplot(top20Retweeted, aes(x=reorder(id_str,retweet_count), y=retweet_count,fill=-tweeted_to)) + coord_flip() +
scale_y_continuous(labels=comma) +
scale_x_discrete(labels=labs) +
geom_bar(alpha=0.8, stat='identity') +
xlab('User:Followers') + ylab('Times Retweeted') +
ggtitle('Top 20 retweeted tweets\n')
plt
In [7]:
# Helper function to plot hashtags, urls.
def getAttrFreq(df, listKey, attrKey):
all_attrs = pd.Series([attr for attrs in df[listKey]
for attr in attrs.split(" ")])
all_attrs = all_attrs[all_attrs.map(len) > 0] #Discard blank ones
all_attrs.reset_index(drop=True, inplace=True)
freqDF = pd.value_counts(all_attrs).reset_index()
freqDF.columns = [attrKey, 'count']
return freqDF
def getKeyFreq(df, key):
freqDF = pd.value_counts(tweetsDF[key]).reset_index()
freqDF.columns = [key,'count']
return freqDF
In [8]:
%%R -w 800 -h 600 -u px
plotTopNAttrs <- function(df,label,entity) {
labs <- rev(reorder(df[,entity],df$count))
plt <- ggplot(df, aes_string(x=paste('reorder(',entity,',count)',sep=''), y='count')) + coord_flip() +
scale_y_continuous(labels=comma) +
scale_x_discrete(labels=labs) +
geom_bar(alpha=0.8, stat='identity') +
xlab(entity) + ylab('Count') +
ggtitle(paste('Top ',length(df[,entity]),' ',label,'\n',sep=''))
return(plt)
}
In [27]:
# Plot Top 20 hashtags in Tweets (including retweets)
top_20_hashtags= getAttrFreq(tweetsDF, 'hashtags', 'hashtag')[:20]
In [28]:
%%R -i top_20_hashtags -w 800 -h 600 -u px
plotTopNAttrs(top_20_hashtags,'hashtags','hashtag')
In [29]:
# Plot Top 20 hashtags in unique Tweets.
top_20_hashtags= getAttrFreq(uniqtweetsDF, 'hashtags', 'hashtag')[:20]
In [30]:
%%R -i top_20_hashtags -w 800 -h 600 -u px
plotTopNAttrs(top_20_hashtags,'hashtags','hashtag')
In [39]:
# Tweets and Retweets by Source
top_20_sources = getKeyFreq(tweetsDF, 'source')[:20]
In [42]:
%%R -i top_20_sources -w 800 -h 600 -u px
plotTopNAttrs(top_20_sources,'sources','source')
Looking at the chart above
Next we look at top 20 User Locations
Note A user location is the location a user puts in his profile, this is not the same as a tweets geo location, so if some one puts London, UK as their location and actually tweets from New York, NY, it would still count as a UK tweet for this graph.
In [49]:
# Tweets and Retweets by User Locations
# Note these are locations set in a User Profile, not Geo Locations associated with a Tweet.
top_N_userLocs = getKeyFreq(tweetsDF, 'user_loc')[1:21]
# TODO Figure out if we can buckets like 'NY', 'New York', 'New York, NY' in a single bucket.
In [50]:
%%R -i top_20_userLocs -w 800 -h 600 -u px
plotTopNAttrs(top_20_userLocs,'User Locations','user_loc')
Frankly this was bit of a surprise, UK and especially London leads the pack.
There is some data clean up required here, like 'London' and 'London, UK' or "New York" and "New York, NY" should be combined. But regardless London, UK still is ahead of the other locations by a fair bit of margin.
I had no idead London was a hotbed for infosec crowd.
In [22]:
# Plot Top 20 URLS in All Tweets.
top_20_urls= getAttrFreq(tweetsDF, 'urls', 'url')[:20]
ggplot(top_20_urls, aes(x='factor(url)', y='count')) + \
geom_histogram(stat='identity',alpha=0.8) + scale_y_continuous(labels='comma') + \
theme(axis_text_x=element_text(angle=45, hjust=1)) + xlab('URL') + \
ylab('Count') + ggtitle('Top 20 URLs in Tweets related to Shellshock\n')
Out[22]:
I would not have expected cnet.com to be a leading cited source for infosec/vulnerability data. But there is a reason for this, that cnet URL is the one that is mentioned in the most retweeed tweet (the one that got retweeted ~ 11K times), hence it gets to be the leading URL in all the tweets.
Next we look at URLs in only original tweets for a comparison.
In [23]:
# Plot Top 20 URLS in Original Tweets.
top_20_urls= getAttrFreq(uniqtweetsDF, 'urls', 'url')[:20]
ggplot(top_20_urls, aes(x='factor(url)', y='count')) + \
geom_histogram(stat='identity',alpha=0.8) + scale_y_continuous(labels='comma') + \
theme(axis_text_x=element_text(angle=45, hjust=1)) + xlab('URL') + \
ylab('Count') + ggtitle('Top 20 URLs in Tweets related to Shellshock\n')
Out[23]:
As you can now see, the cnet URL is no where to be found if you just take in to account the original tweets. Here the more traditional infosec focused websites take lead with notable exception of BBC, wired.
In [31]:
# Plot Top 20 URLS in unique Tweets.
top_20_domains= getAttrFreq(tweetsDF, 'urlDomains', 'domain')[:20]
In [32]:
%%R -i top_20_domains -w 800 -h 600 -u px
plotTopNAttrs(top_20_domains,'domains','domain')
Nothing of much interest here, just a bunch of URL shortening services. Next we look at URL Domains in only original tweets for a comparison.
In [33]:
# Plot Top 20 URLS in unique Tweets.
top_20_domains= getAttrFreq(uniqtweetsDF, 'urlDomains', 'domain')[:20]
In [34]:
%%R -i top_20_domains -w 800 -h 600 -u px
plotTopNAttrs(top_20_domains,'domains','domain')
In [53]:
top_20_users = getKeyFreq(tweetsDF, 'screen_name')[:20]
In [55]:
%%R -i top_20_users -w 800 -h 600 -u px
plotTopNAttrs(top_20_users,'users','screen_name')
This needs more drilling into, apparently a lot of accounts were repeat tweeting about 'shellshock', need to find if these were unique tweets or same stuff being tweeted over and over again.
In [56]:
import nltk
from nltk.corpus import cmudict
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
# for now only unique tweets in English language
uniqtexts = uniqtweetsDF.query('lang == "en"')['text']
t = nltk.tokenize.WhitespaceTokenizer()
p=re.compile(r'[^-#A-Za-z0-9]') # get rid of unwanted characters
tokenLists = uniqtexts.map(lambda text: [p.sub('', word.strip().lower())
for word in t.tokenize(text) if
(not word.lower() in ENGLISH_STOPWORDS) and
( len(word.strip()) > 1)])
all_tokens = pd.Series([token for tokenList in tokenLists for token in tokenList if
(token is not None) and
(token != '')])
In [58]:
# Plot the top 20 tokens.
top_N_words = pd.value_counts(all_tokens)[:20].reset_index()
top_N_words.columns=['token','count']
In [59]:
%%R -i top_N_words -w 800 -h 600 -u px
plotTopNAttrs(top_N_words,'Words','token')
In [32]:
ggplot(uniqtweetsDF.query('(retweet_count >0)').reset_index(),
aes(x='retweet_count', color='factor(status)', fill='factor(status)')) + \
geom_density(alpha=0.2) + scale_x_log10() + \
theme(axis_text_x=element_text(angle=45, hjust=1)) + \
xlim(low=1) + xlab('Times Retweeted') + ylab('Density') + \
ggtitle('Density Plot of Retweet Count of Each Original Tweet')
Out[32]:
This plot shows the density plot of potential viewers of this tweet.
NOTE This is just based on immediate number of followers of a user, and does not include followers of followers etc.
As can be seen tweets were seen by < 10K or so users, again confirming that infosec crowd is not as popular as celebrities.
In [31]:
ggplot(tweetsDF.query('(tweeted_to > 10)').reset_index(),
aes(x='tweeted_to', color='factor(status)', fill='factor(status)')) + \
geom_density(alpha=0.2) + scale_x_log10() + \
theme(axis_text_x=element_text(angle=45, hjust=1)) + \
xlab('Potential Viewers') + ylab('Density (Blue is retweets)') + \
ggtitle('Density Plot of Potential Viewers of Each Tweet')
Out[31]:
So here my assumption was that there should be some positivecorrelation between the number of followers you have and the number of retweets you'll get.
Surprisingly what I found was a bit of negetive correlation, i.e. lower the number of our followers higher the number of retweets. And the only reason this can be is if you have low number of followers but they themselves have high number of followers. In fact the density plot of potential viewers (the one above this one) sort of confirms this.
It was still interesting to see this graphed out.
In [76]:
ggplot(uniqtweetsDF.query('(retweet_count>0) & (tweeted_to<500000)').reset_index(),
aes(x='tweeted_to', y='retweet_count') )+ \
geom_point() + scale_y_continuous(labels='comma') + \
theme(axis_text_x=element_text(angle=45, hjust=1)) + \
xlab('Potential Viewers') + ylab('Number of Retweets') + \
scale_y_log10() + ylim(1,2000) + xlim(0,500000) +\
ggtitle('Potential Viewers v/s Retweeted tweets related to Shellshock')
Out[76]:
In [ ]:
tweetsWithLoc=tweetsDF[pd.notnull(tweetsDF['geo'])].reset_index()
# Get rid of geo points with coords [0.0,0.0]
tweetsWithLoc = tweetsWithLoc[tweetsWithLoc['geo'].map(
lambda g: True if ((g['coordinates'][0] != 0)|(g['coordinates'][1] != 0)) else False)].reset_index()
print '\n{0} out of {1} tweets i.e ({2:.2f})% have location data\n'.format(
len(tweetsWithLoc), len(tweetsDF), (100.0*len(tweetsWithLoc)/len(tweetsDF)))
In [ ]:
js_loader = """
function verifyJSLoaded(){
/*var jsapiLoaded = (typeof google === 'object' && typeof google.maps === 'object');
console.log("Google API Loaded: " + jsapiLoaded);
return jsapiLoaded;*/
return false;
}
function loadScript() {
if (!verifyJSLoaded()) {
console.log('Loading Google API.');
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://maps.googleapis.com/maps/api/js?sensor=false&libraries=visualization&callback=console.log";
document.body.appendChild(script);
}
}
loadScript();
"""
Javascript(js_loader)
In [ ]:
html_template = '<div id="{0}" style="width: 1024px; height: 768px"></div>'
#This is to make sure the JS gets loaded before we try to load the maps
time.sleep(1)
In [ ]:
def gen_javascript(gJSONs, div_id):
"""
Generates javascript to draw a heatmap with Google Maps API.
"""
# Creates Javascript objects which will comprise geoData.
coords = ',\n '.join(["new google.maps.LatLng(%s, %s)" % tuple(pair) for pair in gJSONs])
template_jscript = """
var geoData = [
%s
];
var map, heatmap;
function hmap_initialize() {
var mapOptions = {
zoom: 1,
center: new google.maps.LatLng(30.5171, 0.1062),
mapTypeId: google.maps.MapTypeId.SATELLITE
};
map = new google.maps.Map(document.getElementById('%s'),
mapOptions);
var pointArray = new google.maps.MVCArray(geoData);
heatmap = new google.maps.visualization.HeatmapLayer({
data: pointArray
});
heatmap.setMap(map);
}
hmap_initialize();
"""
return template_jscript % (coords, div_id)
In [ ]:
HTML(html_template.format('map_001'))
In [ ]:
jscript = gen_javascript(tweetsWithLoc['geo'].map(lambda x: x['coordinates']),'map_001')
Javascript(jscript)
In [77]:
# All of above works in a localy running notebook but not in nbviewer,
# So here's a static image of what the above code generated
from IPython.display import Image
Image('tweetsOnMap.png')
Out[77]:
In [ ]: